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(57) Abstract 

A method and system for analyzing data over a network are described. A Web server communicates with a storage system that 
stores genomic information in a database. Client systems connect to the Web server over a network, such as the Internet, using standard 
Web protocols (e.g., HTTP). Tb& Web sender sends Web pages to the client through which pages the user of the client can load genomic 
infonn«Ttion into the database. The client user obtains the genomic information for uploading from genomic samples of organisms hybridized 
to chips or arrays. With the database populated with genomic information, the client user interactively selects and performs an analysis 
on selected samples over the network. The result produced by the analysis is a list of genes or a list of gene lists that becomes part of 
the database. Tliese gene lists or lists of gene lists can then be compared with other previously stored lists or with user-generated and/or 
user-selected gene lists. Accordingly, subsequent users of the database can review the research performed by others, and incorporate that 
research into their own research. 
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A METHOD AP«> RELATIONAL DATABASE MANAGEMENT SYSTEM FOR 
STORING, COMPARING, AND DISPLAYING RESULTS PRODUCED BY ANALYSES 

OF GENE ARRAY DATA 



Related Application 

This application claims the benefit of the filing date of copending U.S. Provisional 
Application, Serial No. 60/134,793, filed May 19, 1999, entiUed "Relational Database 
Management System For Gene Array Data," the entirety of which provisional application is 
incorporated by reference herein, 

5 Background of the Invention 

Array-based expression analysis tools permit the simultaneous measurement of RNA 
expression levels for all or part of the genome of an organism. Arrays, or "expression chips", 
that probe every ORF (open reading firame) in the yeast genome, as well as for several other 
organisms, are now commercially available. Chips probing expression levels of up to 10,000 
10 human genes and ESTs (expressed sequence tags) are also available. The accessibility of 

parallel expression analysis has ushered in a new era of genetic discovery, where the full genetic 
behavior of an organism is measurable in parallel. This widely applicable technology is being 
applied to problems in yeast biology, fimctional genomics, drug discovery, and other domains. 

Despite the great promise that expression profiling holds for biology research, anyone 
15 attempting to use array technology quickly discovers that the ability to produce biological data 
does not imply an ability to interpret that data. Consequently, management and interpretation of 
the massive data sets produced by expression analysis tools have become a bottleneck in 
biological research. Techniques used to analyze expression data, which range fi-om pencil and 
paper to computerized spread sheets, do not provide an adequate means for solving the problems 

BNSDOCID: <WO_0070556A2_I_> 



wo 00/70556 

PCr/USOO/13823 



under 



An h* '^""Trn 

capability Of „^ » '^•a-e. w,«,„u. di.i„„^, 



— 0070556A2_|_> 



wo 00/70556 PCTAJSOO/13823 

-3- 

database. These gene lists or lists of gene lists can then be compared with other previously 
stored lists or with user-generated and/or user-selected gene lists. Accordingly, subsequent users 
of the database can review the research performed by others, and incorporate that research into 
their own research, 

5 In one aspect, the invention features a method for analyzing data. The method comprises 

providing data and rescaling the data to produce rescaled data. The rescaled data may be stored 
in the same database as the sample result. The rescaled data is associated with a pre-selected set 
of parameters. A sample set is generated from the associated rescaled data. Analysis is 
performed on the sample set to produce a sample result, and the sample result is stored in a 
10 database. The stored sample result is associated with a prior result. The prior result can be a 

sample result previously stored in the database, a user-generated result, or a user-selected result. ^ 

In one embodiment, the stored sample result is a list of lists. Each list in the list of lists is 
a list of genes. In another embodiment, the stored sample result is a set of bit vectors. In still 
another embodiment, the associating comprises comparing the sample result vwth the prior result. 
15 The results of associating the stored sample result with prior result may be stored in the database. 

In another aspect, the invention features a system for analyzing data. The system 
includes a calibrator rescaling the data and a pre-selected set of parameters that is associated with 
the rescaled data. A sample set is generated from the associated rescaled data. An analyzer 
performs analysis on the sample set to produce a sample result. A database stores the sample 
20 result. An associator associates the stored sample result with a prior result. The prior result can 
be a sample resuh previously stored in the database, a user-generated result, or a user-selected 
result. 
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understood that more clients and servers than those shown can be connected to the network 30. 
Although shown in Fig. 1 as separate systems, in another embodiment the client 10 and server 20 
can be the same machine. 

The client 10 can be any personal computer (e.g., 286, 386, 486, Pentixmi, Pentium II), 
5 thin-client device, Macintosh computer, Windows-based terminal. Network Computer, wireless 
device, information appliance, RISC Power PC, X-device, workstation, mini computer, main 
frame computer, or other computing device that has a graphical user interface. Windows- 
oriented platforms supported by the client 10 can include Windows 3.x, Windows 95, Windows 
98, Windows NT 3.51, Windows NT 4.0, Windows CE, Windows CE for Windows Based 
10 Terminals, Macintosh, Java, and Unix. The client 10 includes conventional hardware for 

supporting a display screen, a keyboard, memory, a processor, and an input/output device (e.g., a 
mouse). 

The client 10 also has software including browser software 12, e.g., Microsoft Internet 
Explorer™ produced by Microsoft Corporation of Redmond Washington. The browser software 
15 12 provides a graphical user interface to the server 20. Through the Web browser, the client 1 0 
develops and submits search requests for retrieving data from the storage system 40. In general, 
the user of the client formulates queries of the storage system 40, using the keybo^d and the 
input device to point and click on graphical buttons, pull dovm menus, scroll bars, etc., that are 
then submitted to the server 20 over the network 30 

20 Server 20 includes the hardware necessary for running software to access information in 

the storage system 40 in response to client user requests, and for providing an interface for 
transmitting information to the client 10. In one embodiment, the server 20 operates as a Web 
server 32, supporting the World Wide Web protocol (e.g., HTTP protocol) for providing page 

BNSDOCID: <WO 007Q556Ag I > 



wo 00/70556 

PCT/USOO/13823 

da. .o *e c„». .0. .ainMniog Web pages. p„'c.sing URL. an. „„^„^, ,„ 
portions Of *e netwo* 30 (e.g.. worics^tions. s,o.ge sys.en,s. prin.e.) or .o oU.er network ,„ 
one embodnnen, fte server 20 is a 233 MHz Pentiun, II running on a Windows NT 4 0 
worlcsution. In anofter e»bodin,en. *a. i„.p„ves n.„lri.„ser perfornrance. fte server 20 is a 

5 Ultia-4 Span; workstation runninE the Solan.: 7 c .• 

■lung ine solans 2.6 operating system witli foin 400 MHz 

processors and 1 GB of RAM (produced by Sun Microsystems). 

As shown, the server 20 includes the World Wide Web server 32. a World Wide Web 
.nt^rfaoe 34. and a database management system (DBMS, 36. T,. Web interface 34 includes the 
.xecutable code necessary for generating ,ueHes that access infonnation in the storage system 

>0 ^('^-O—e'-Buage statements such as Standard Query Language (SQL) statements) 
Web interface 34 also includes Web applications written in PlySQL, Per, and Java On 
web application enables the client user to dire«,y upload genome expression data files into the 

su.r.ge system40 (hereafter called theloader 35, OtirertheWebappIicationsprovideaWeb 
rnterface to the sto.ge system 40 and perform data analysis such as normalization and 

. 5 comparisons between unlimited number of experiments and ftmctional categorization of an 
organism's genes. 

•n general, the database management system (DBMS, 36 serves as a Web-based search 

enginethatenablestheCientusertosearchforanynumberofgenesaccordingtouser-specified 
^.wordsinnamesorgenedescription. The search engine also operates to fmd and download 

em.diment.theDBMS36isanOracle™DBMS36witi.WebDB.whichisaproductproduced 
by Oracle for implementing dynamic HTm (Hypertext Markup Language). 
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The storage system 40 can be any of a variety of systems that maintains infomiation 
including, for example, a database server, a file storage system having large binary files, a legacy 
mini-computer or main-frame computer with storage. In one embodiment, the storage system 40 
includes a relational database 44 in which the information is stored in a relational format. The 

5 relational database 44 includes tables of columns and rows for holding the information stored in 
the database 44. Each table has a primary key that is any column or set of columns storing a 
value or values which uniquely identify the rows in that table. The tables of the relational 
database 44 can also include a column or set of columns that function as a secondary key. The 
values of secondary key columns are used to match the primary key values of another table. The 

10 relational database 44 supports a set of operations that are performed on the relations within the 
database 44, 

Implementation of the relational database 44 of the storage system 40 can be 
accomplished in various ways. For example, one embodiment of the relational database 44 is an 
Oracle ™ database. An example of another embodiment of the relational database 44 is a 
1 5 Sybase"^*^ database. 

The network 30 can be a local-area network (LAN), an Intranet, or a wide area network 
(WAN) such as the Internet or the World Wide Web. A user of the client 10 can be connected to 
the network 30 through a variety of connections including standard telephone lines, LAN or 
WAN links (e.g., Tl, T3, 56kb, X.25), broadband connections (ISDN, Frame Relay, ATM), and 
20 wireless connections. The connections can be established using a variety of communication 

protocols (e.g., HTTP, TCP/IP, IPX, SPX, NetBIOS, Ethernet, RS232, and direct asynchronous 
connections). 
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generates a new document 38 containing the database information and transmits the new 
document 38 to the client 10 where the database information is displayed in the graphical user 
interface 14. 

Fig. 2 shows an embodiment of a process for accessing information in the database 44 
5 according to the principles of the invention. The client user uploads (step 100) raw data into the 
database 44. In one embodiment, the data is genomic data. Other types of data can be used to 
practice the principles of the invention. The raw genomic data is obtained from "chips" (or 
"arrays"). A chip is a solid substrate with DNA probes that are either synthesized or spotted 
onto the substrate surface in a grid layout. Chips may contain from a few hundred to tens of 

10 thousands of probes, each of which corresponds to a single nucleotide sequence of interest. A 
nucleotide sequence in turn corresponds to a genetic feature of interest, such as the coding for a 
specific protein. For example, a probe may refer to a mRNA strand that codes for a specific 
protein or amino acid sequence. Other non-mRNA probes are also placed on chips, so a 
nucleotide sequence may refer to a region upstream of a gene, or to a mitochondrial mRNA or 

1 5 other genetic material. For example, the Affymetrix GeneChip™ platform determines raw 

genomic data as the average difference score and present call (i.e., a measure of the presence or 
absence of a message) for each probe set on the array. In one embodiment, multiple 
measurements per spot, including the average intensity and background values for each set of 
probes on the array, are supported. 

20 As used hereafter, a data set includes the genomic data that are obtained from the 

hybridization of one sample to a set of chips that span the genome of the organism (or some 
subset of the genome). A sample refers to a colony of cells grown from a particular genetic 
stram of organism (e.g., yeast) that has a particular genotype. Thus, the database services of the 
invention handle each sample independently. 
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The Web interface 34 produces (step 1 12) a sample set using the rescaled samples. A 
Web application of the Web interface 34 performs (step 1 16) a user-specified analysis on the 
sample set. As described in more detail below, one embodiment offers two types of analysis: (1) 
rule based analysis, and (2) non-hierarchical clustering analysis. 

5 Execution of the user-specified analysis produces a result (hereafter "sample result"). In 

one embodiment, the sample result is a list of genes (i.e., a "gene list") that are co-expressed in 

some way. An exemplary representation of a list of genes is: 

Sample Result: 
gene 1 

10 gene 2 

gene 3 

In another embodiment, the sample result is a list of lists of genes (i.e., a list of gene 

lists). An exemplary representation of a list of lists of genes is: 

Sample Result: 
15 Gene List for Result Type 1 

gene 1 
gene 2 

Gene List for Result Type 2 
gene 3 

20 gene 4 

In still another embodiment, the sample resuh is a set of bit vectors. An exemplary 
representation of a set of bit vectors is: 
Sample result: 

Result Type 1 Resuh Type 2 Result Type 3 

gene 1 x x 

gene 2 x x x 

gene 3 x 

gene 4 x x 

Other embodiments of a sample result also include information that is associated with the 

25 genes in the gene list. For example, each gene can be associated with a scalar value representing 

a confidence metric for that gene (e.g., a scalar value of 1 means information about the gene is 
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this association is a comparison between the stored sample result and a prior result. The 
comparison in one embodiment looks for genes that appear in both the stored sample result and 
the prior resxxlt. 



The prior result can be another sample result derived from a previous analysis performed 
5 on the information in the database 44 or the prior result can be a user-created or predefmed list 
stored in the database 44. An example of a predefined list is a MlPS-generated categorization 
list. MIPS stands for the Munich Information Center for Protein Sequences and is a 
bioinformatics group that publishes various functional categorizations of genes on the Internet. 
The following is an example of a small portion of the functional categorizations of yeast genes 
10 published by MIPS: 

TRANSCRIPTION (751 ORFs^ 

rRNA transcription (100 ORFs) 

rRNA synthesis (39 ORFs) 

rRNA processmg (58 ORFs) 
15 other rRNA-transcription activities (3 ORFs) 

tRNA transcription (82 ORFs) 

tRNA synthesis (24 ORFs) 

tRNA processing (37 ORFs) 

tRNA modification (16 ORFs) 
20 other tRNA-transcription activities (4 ORFs) 

mRNA transcrition (544 ORFs) 

mRNA synthesis (410 ORFs) 

general transcription activities (64 ORFs) 

transcriptional control (326 ORFs) 
25 chromatin modification (32 ORFs) 

mRNA processing (splicing) (91 ORFs) 

mRNA processing (5*-, 3*-end processing, mRNA degradation) (37 ORFs) 
other mRNA-transcription activities (10 ORFs) 
RNA transport (27 ORFs) 
30 other transcription activities (58 ORFs) 

PROTEIN SYNTHESIS (347 ORFs^ 
ribosomal proteins (206 ORFs) 

translation (initiation, elongation and termination) (62 ORFs) 
translational control (30 ORFs) 
35 tRNA-synthetases (37 ORFs) 

other protein-synthesis activities (15 ORFs) 



BNSDOCID: <WO 0070556A2_I_> 



10 



Each V 

^""^''^"^ in the MIPS list- 

te.g..nietaboiisn, 

' e.g., as an appJicaf loader 35 ^ 

Pro^ ^^^^ ™pi,„,„,^ 



wo 00/70556 PCT/USOO/13823 

- 15- 

To keep data set load times to a minimum, and thus provide acceptable interactive 
response to the client user, the loader 35 inserts raw data row by row into an empty temporary 
table. The loader 35 then selects and inserts the raw data at once into a large table containing all 
data sets. In one embodiment, this large table contains 1 .6 x 10^ rows. This load optimization 
5 technique improves insert times and reduces rollback space consumption considerably. Also, the 
optimization technique causes insert times to be proportional to the size of the data set being 
inserted rather than the size of the table. 

Rescaling data sets 

Before data sets for different chips can be analyzed together, calibration or rescaling of 
10 the raw data in the data sets is necessary. The rescaling can be performed in a variety of ways 
depending on the nature of the experiment. For example, known quantities of exogenous control 
RNAs can be used for rescaling data values read from one chip to those read from another chip. 
For experiments in which the overall mRNA population is expected to remain stable, bulk signal 
scaling methods can also be employed. In situations where overall expression is significantly 
15 affected, for example when parts of the transcription apparatus are knocked out or inactivated 
because of temperature-sensitive mutations, then control-based rescaling is appropriate. Still 
referring to Fig. 3, the loader 35 allows the client user to choose the rescaling method (by 
specifying a reference set in field 135) and associated parameters when data set is loaded. The 
loader 35 also provides a set of default options (in field 137) that represent the typical parameters 
20 for rescaling. 

To implement rescaling, a reference set is defined to include a sample used as a control 
for rescaling, a rescaling algorithm and any parameters that the rescaling requires, and a set of 
samples whose chips are rescaled to the chips from the control sample. Currently all available 
rescaling algorithms are stable with respect to the contents of the reference set; that is, adding 
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by a chip probe. To map from physical chip probes to genetic features, a scheme is employed, 
which chooses the "best" probe on a chip for each genetic feature that is represented, based on a 
set of empirically chosen rules. Additionally, to make cross-technology comparisons, (e.g., from 
different chip manufacturers) a unique gene catalog describing every gene queried by a chip is 
5 used so that measurements of the same gene described under two different accession numbers 
can still be compared. 

Data Retrieval 

After loading and rescaling data sets, the client user can extract information from the 
database 44 using a retrieval tool (i.e., a Web application on the server 20) that allows the client 
1 0 user to select a set of genes across a set of samples, and download the resulting matrix as text or 
as an HTML table. The client user can load the resulting file into a spreadsheet for local (i.e., 
client 1 0) analysis. 

Data Organization - Projects and Gene Categories 

To organize the information stored in the database 44, the data used in analyses are 
15 divided into projects. Each project contains a sample set, which is a group of related samples 

derived from the same reference set. These sample sets can then be analyzed, to produce a set of 
results (i.e., a sample result). Each sample result can contain a list of genes or a list of gene lists 
and numeric values that describe that gene list, such as, for example, a centroid. Presumably the 
genes in a gene list are those genes that were co-expressed in an experiment. Each project is 
20 associated with an individual (e.g., a researcher). In the schema of the database 44, described 
below in connection with Fig. 6, each project is an entry in the PROJECTS table. 
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Groups of genes 

Another mechanism for organizing the infonnation in the database 44 is to place genes 
into user-defined categories. The categories can then be placed into groups. The MIPS 
functional catalogues described above is an example of this organizational mechanism. As 
5 described in more detail in the Data Mining section below, these user-defined lists of genes can 
be compared with lists of genes (or lists of gene lists) that are produced by user-specified 
analyses. 

Data extraction 

The manner of storage of the information in the database 44 facilitates extraction of the 
10 data sets for external analysis (i.e.. local analysis) by the client user (e.g., using a spread sheet). 
Further, the client user can extract data sets for multiple samples across a group of features. Set 
operations (i.e., AND, OR, etc.) on features are also supported. For example, the set of genes 
up-regulated across a particular time course experiment can be combined with those genes that 
were down-regulated. The resulting combined set of rows can be extracted across the samples 
1 5 involved in the particular time course experiment or some other time course experiment for 
external analysis. 

Data Set Analysis 

To analyze the data sets stored in the database 44, the client user groups samples into 
sample sets. As described above, all samples in a sample set are from the same reference set, 
20 and sample sets are stored under projects for data organizational purposes. An analysis produces 
a comparison of the samples in the sample set to derive multiple lists of genetic features whose 
expression has been affected in some particular way. In a previously noted embodiment, sample 
sets can be analyzed using one of two tools: rule-based analysis and non-hierarchical clustering. 
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Rule-based analysis 

Within the sample set, each sample plays a role, e.g. wild-type repUca 0, time point 15' 
replica 1 . Replicas are repeated experiments which can be used by analysis to control for 
experimental noise. After assigning roles to the samples, the client user chooses the rules to 
5 apply to the analysis of those samples. The client user selects the rules to apply from a set of 
predefined rules. The Web interface 34 then executes the selected rules in the DBMS 36 to 
produce a list or lists of affected genes. This sample result is then stored in the database 44, 
available to subsequent searches by client users. 

Rule-based analysis allows the user to choose a set of predefined rules that determine 
10 which genes are co-expressed. An example of a rule is "all ORFs whose expression levels 
change by a factor of 2." An example of another rule is "all ORFs whose average expression 
levels across replicates monotonically increase over time and for which at least half of the 
measurements for each time point are of high confidence." Fig. 4 shows a screen shot of an 
exemplary graphical user interface 140 presented to the client user to perform a rule-based 
15 analysis. 

In one embodiment, rule based analysis is implemented as an external module that uses R 
package of statistical programs, which is an implementation of the S programming language for 
mathematical modeling, and interacts with the database 44 through the DBMS 36. The R 
language is described in Ihaka & Gentleman (1996), "R: A Language for Data Analysis and 
20 Graphics", Journal of Computational and Graphical Statistics, 5, 299-314. CGI programs, 

written in PERL, control the R programs to provide a graphical user interface. Analyses vmtten 
in R can extract a matrix of values from the database 44 corresponding to expression levels 
across a sample set, and determine which genetic features are co-regulated. The R programs 
directly load the results of the rule-based analysis in the database 44. 
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type) cells. It is reported as a positive number if the ratio is >=1, and as the negative reciprocal 
of the ratio if it is <1 . Additionally, the R package of programs provides a set of plotting tools 
for visualizing the data. For example, some R programs plot histograms of log fold changes 
between chips or samples. 

S Data Mining 

The above-described analysis and visualization tools allow client users to seek answers to 
questions involving a small nimiber of samples. In accordance with the principles of the 
invention, the client user can also seek answers to questions that encompass different data sets or 
the entire database 44. As described below, the ability to compare different lists of genes 
1 0 provides a data mining capability. 

As described above, sample results are stored in the database 44 as a set (i.e., list) of 
genes. Consequently, any user of a client cormected to the server 20 can browse and search 
search through results produced by the analyses of other client users. Such searches for genes by 
name, strain, sample, condition, or by gene membership. For example, a client user can obtain 
1 5 answers to queries such as "what analyses showed a change in expression for gene X". 

After the sample results are stored into the database 44, the client user can also compare 
those sample results with other previously stored sample results. Further, such stored sample 
results can be compared with other lists of genes, for example, user-defined gene lists or 
literature-derived classifications of genes, such as the MIPS functional catalogues. This 
20 capability enables the comparison of sample results to external information, such as knowledge 
extracted from scientific literature. The client user can categorize such knowledge based on 
whatever criteria they choose. These user-defined categorizations have a particular format 
adapted to facilitate comparisons with sample results stored in the database 44. The particular 
format follows a semi-hierarchical scheme for representing information, such as the MIPS 
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or sample result appears in the respective box. Accordingly, the client user can initiate one of 
three types of comparisons: (1) a prior result with a prior result, (2) a prior result with a sample 
result, and (3) sample result with a sample result. Upon selecting the "Submit Query" button 
160, a comparison is performed between the two selected results. 

5 Examples of queries that the client user can attempt to answer through the interface 150 

are "which genes that are up-regulated under condition X encode for members of the ribosomal 
complex?" and '^vhich conditions show considerable overlap with enzymatic activity Y?" Such 
data mining queries involve set comparisons and are implemented as partially constrained 
Cartesian products in SQL. 
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Other tables in the schema 200 include a REFERENCE_SET table 214, a 
SAMPLE_IN_REFERENCE_SET table 216, a ABS^EXPRESSION table 21 8, and a 
ABS_DATA_TAB table 220. The REFERENCE_SET table 214 groups samples that have been 
rescaled together using the same set of parameters and a single control sample. Each sample 
5 other than the control sample is rescaled using parameters and the values associated with the 
control sample. The SAMPLE_IN_REFERENCE_SET table 216 maintams the relationships 
betv^een samples and reference sets. The SAMPLE_IN_REFERENCE_SET table 216 includes 
a Reference__set_ID attribute that is a secondary key for searching the REFERENCE__SET table 
2 1 4 and a Sample_ID attribute that points to the SAMPLES table 208, 

1 0 The ABS_EXPRESSION table 2 1 8 stores an entry for every chip that is inserted into a 

reference set. Attributes of the ABS__EXPRESSION table 218 store information describing the 
rescaling, such as scale factor and reference chip. The ABS_DATA_TAB table 220 stores 
rescaled data values and points to the SAMPLE__IN_REFERENCE_SET table 216. 

Still other tables in the schema 200 include a SAMPLE__SET table 222, an 
1 5 ANALYSIS_RESULTS table 224, a GENE_IN_LIST table 226, a PROJECTS table 228, a 
SAMPLE__IN_PROJECTS table 230, a SAMPLE_IN_SSET table 232, and an 
ANALYSIS^PARAMETERS table 234. 

The SAMPLE_SET table 222 groups samples that are analyzed together. In one 
embodiment, all samples in a sample set come from the same reference set. The 
20 ANALYSIS_RESULTS table 224 holds the sample result sets generated by an analysis. There is 
one entry in the ANALYSIS_RESULTS table 224 for each sample result produced by an 
analysis. Note that one analysis may produce multiple gene lists (thus, the sample result is a list 
of gene lists). The ANALYSIS_PARAMETERS table 234 identifies the parameters used to 
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that data. In this example, there are foxir data files, one for each of the four chips associated with 
one sample- Each data file contains one or more measurements of interest per probe located on 
array. The loader 35 uploads each data file into multiple tables, including the TSV_RAW 206 
and TS V_FILES 204 tables. The TS V_FILES table 204 then contams one row for each data set 
5 loaded. The TSV_RAW table 206 contains one row for each probe present in the data file, as 
shown for example in TABLE 1 below: 



TABLE 1 



TABLE NAME 
TSV RAW 








FILE ID 


PROBE ID 


AVGDIFF 


CONE 


100 


200 


374 


P 


101 


201 


258 


P 



Using the S AMPLE_ON_CHIP table 202, the data set is associated with sample 



information describing the sample and the chip (array) on which the sample was hybridized, as 
1 0 shown in TABLE 2 below: 



TABLE 2 



TABLE NAME 
SAMPLE ON CHIP 






SAMPLE ID 


FILE ID 


CHIP ID 


300 


100 


400 


300 


101 


401 



Then the loaded data is rescaled with respect to a pre-defined set of rescaling parameters 
(reference set). The rescaling constants for each data file are stored in the ABS_EXPRESSION 
table 21 8, as shown m TABLE 3 below: 
15 TABLE 3 



TABLE NAME 
ABS EXPRESSION 






RESCALED SAMPLE ID 


CHIP ID 


FACTOR 


500 


400 


0.60953 


500 


401 


0.78251 
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replicate 2 samples. Various other types of comparisons are possible, For example, another 
method for comparing the samples in the sample set is to average the mutant replica values and 
to divide that average by the average of the wildtype values. 

The selected analysis is performed and the sample results are stored. In this example, the 
5 analysis performed compares the average expression level of the control samples to that of the 
test samples for each gene, determining if the genes differ by more than a factor of 2 either up or 
down. If the test samples are at least 2 times (2X) the control samples, the gene is assigned to 
the "up" result. If test samples are at least 2X lower, then the gene is assigned to the "down" 
result. Referring to TABLE 6 below, the selected analysis (here, ANALYSIS ID 900) illustrates 
10 an example of an analysis that can produce multiple lists of genes (i.e., a list of lists): one list for 
"up" genes, and another list for "down" genes. 



TABLE 6 



TABLE NAME 
ANALYSIS RESULTS 






ID 


ANALYSIS ID 


NAME 


800 


900 


up 


801 


900 


down 



As shown in TABLE 7, the GENE_IN_LIST table 226 associates each gene with the 



appropriate result(s) for that gene: 
15 TABLE? 



TABLE NAME 
GENE IN LIST 




GENE 


RESULT ID 


YOR095C 


801 


YFL014W 


800 



Now answers to questions such as "which genes were in result "up" in analysis x and in 
analysis y" can be provided by the database 44. In the present example, the gene YFL014W is a 
gene with an "up" result. 
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While the invention has been shown and described with reference to specific preferred 
embodiments, it should be imderstood by those skilled in the art that various changes in form and 
detail may be made therein vsathout departing from the spirit and scope of the invention as 
defined by the following claims. 
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3 a pre-selected set of parameters associated with the rescaled data; 

4 a sample set generated from the associated rescaled data; 

5 an analyzer performing analysis on the sample set to produce a sample result; 

6 a database storing the sample result; and 

7 an associator associating the stored sample result with a prior result. 

1 13. The system of claim 12 wherein the prior result is a sample result previously stored in the 

2 database. 

1 14. The method claim 12 wherein the prior result is a user-generated result. 

1 15. The method claim 12 wherein the prior result is a user-selected result. 
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